Large scale statistical analysis of GEO datasets

نویسندگان

  • Bernard Ycart
  • Konstantina Charmpi
  • Sophie Rousseaux
  • Jean-Jacques Fourni'e
چکیده

The problem addressed here is that of simultaneous treatment of several gene expression datasets, possibly collected under different experimental conditions and/or platforms. Using robust statistics, a large scale statistical analysis has been conducted over 20 datasets downloaded from the Gene Expression Omnibus repository. The differences between datasets are compared to the variability inside a given dataset. Evidence that meaningful biological information can be extracted by merging different sources is provided. Background Many genomewide expression datasets have been published during the past ten years. Repositories, such as the Gene Expression Omnibus (GEO) database [1], have made available an impressive wealth of data. Using them as a whole, instead of restricting statistical studies to one particular dataset, is tantalizing. Two recently published R/Bioconductor packages [2, 3] provide various tools for merging datasets coming from different studies. However, a serious doubt has been cast by Haibe-Kains et al. [4], after comparing two large scale pharmacogenomic studies: whereas both studies had a good overall correlation, important discordances could be observed. Thus, the following crucial question remains to be answered: is it statistically legitimate to merge datasets coming from different studies? An attempt at answering this question is reported here. Merging different datasets, requires prior checking that the information they contain is compatible, and hence that detected differences between gene expressions under different conditions are not artifacts, due to experimental or data processing methods. An obvious obstacle to simultaneous treatment is that expression data collected under different experimental conditions and/or platforms usually have incompatible

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies

Left-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-cens...

متن کامل

Accuracy evaluation of different statistical and geostatistical censored data imputation approaches (Case study: Sari Gunay gold deposit)

Most of the geochemical datasets include missing data with different portions and this may cause a significant problem in geostatistical modeling or multivariate analysis of the data. Therefore, it is common to impute the missing data in most of geochemical studies. In this study, three approaches called half detection (HD), multiple imputation (MI), and the cosimulation based on Markov model 2...

متن کامل

Comparative Correlation Structure of Colon Cancer Locus Specific Methylation: Characterisation of Patient Profiles and Potential Markers across 3 Array-Based Datasets

Abnormal DNA-methylation is well known to play an important role in cancer onset and development, and colon cancer is no exception to this rule. Recent years have seen the increased use of large-scale technologies, (such as methylation microarray assays or specific sequencing of methylated DNA), to determine whole genome profiles of CpG island methylation in tissue samples. Comprehensive study ...

متن کامل

Similarity of markers identified from cancer gene expression studies: observations from GEO

Gene expression profiling has been extensively conducted in cancer research. The analysis of multiple independent cancer gene expression datasets may provide additional information and complement single-dataset analysis. In this study, we conduct multi-dataset analysis and are interested in evaluating the similarity of cancer-associated genes identified from different datasets. The first object...

متن کامل

Zoning Electrical Conductivity and Acidity of Groundwater through Using Geo-statistical Method: A Case Study in Semirom Plain, Esfahan Province

The groundwater quality research is one of the important and its pollution control was included insome research literatures. Ground water quality has spatial and temporal variation so classical statisticscould not account these variations at the regional scale researches. This study usedgeo-statisticalmethodsto optimize an interpolation method in order to estimate the spatial distribution of pH...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014